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Abstract 

Hardness  results  for  maximum  agreement  problems  have 
close  connections  to  hardness  results  for  proper  learning 
in  computational  learning  theory.  In  this  paper  we  prove 
two  hardness  results  for  the  problem  of  finding  a  low 
degree  polynomial  threshold  function  (PTF)  which  has 
the  maximum  possible  agreement  with  a  given  set  of 
labeled  examples  in  Rn  x  {  —  1,1}-  We  prove  that  for  any 
constants  d  ^  1,  e  >  0, 

•  Assuming  the  Unique  Games  Conjecture,  no 
polynomial-time  algorithm  can  find  a  degree-  d  PTF 
that  is  consistent  with  a  ( |  +  e)  fraction  of  a  given 
set  of  labeled  examples  in  R71  x  {  — 1, 1},  even  if  there 
exists  a  degree-d  PTF  that  is  consistent  with  a  1  —  t 
fraction  of  the  examples. 

•  It  is  NP-liard  to  find  a  degree-2  PTF  that  is  consis¬ 
tent  with  a  (|  +  e)  fraction  of  a  given  set  of  labeled 
examples  in  Rn  x  {— 1, 1},  even  if  there  exists  a  half¬ 
space  (degree-1  PTF)  that  is  consistent  with  a  1  —  t 
fraction  of  the  examples. 

These  results  immediately  imply  the  following  hard¬ 
ness  of  learning  results:  (i)  Assuming  the  Unique  Games 
Conjecture,  there  is  no  better-than-trivial  proper  learning 
algorithm  that  agnostically  learns  degree-d  PTFs  under 
arbitrary  distributions;  (ii)  There  is  no  better-than-trivial 
learning  algorithm  that  outputs  degree-2  PTFs  and  ag¬ 
nostically  learns  halfspaces  (i.e.  degree-1  PTFs)  under 
arbitrary  distributions. 
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1  Introduction 

A  polynomial  threshold  function  (PTF)  of  degree 
d  is  a  function  /  :  R"  — >  {— 1,+1}  of  the  form 
f{x)  =  sign(p(a:)),  where 

P(x )  =  ^II1' 

multiset  <S'C[n],|<S'|^d  i£S 

is  a  degree-d  multivariate  polynomial  with  real  coeffi¬ 
cients.  Degree-1  PTFs  are  commonly  known  as  half¬ 
spaces  or  linear  threshold  functions,  and  have  been 
intensively  studied  for  decades  in  fields  as  diverse 
as  theoretical  neuroscience,  social  choice  theory  and 
Boolean  circuit  complexity. 

The  last  few  years  have  witnessed  a  surge  of  re¬ 
search  interest  and  results  in  theoretical  computer 
science  on  halfspaces  and  low-degree  PTFs,  see  e.g. 
[25,  23,  7,  8,  10,  6,  15].  One  reason  for  this  interest  is 
the  central  role  played  by  low-degree  PTFs  (and  half¬ 
spaces  in  particular)  in  both  practical  and  theoretical 
aspects  of  machine  learning,  where  many  learning  al¬ 
gorithms  either  implicitly  or  explicitly  use  low-degree 
PTFs  as  their  hypotheses.  More  specifically,  several 
widely  used  linear  separator  learning  algorithms  such 
as  the  Perceptron  algorithm  and  the  “maximum  mar¬ 
gin”  algorithm  at  the  heart  of  Support  Vector  Ma¬ 
chines  output  halfspaces  as  their  hypotheses.  These 
and  other  halfspace-basecl  learning  methods  are  com¬ 
monly  augmented  in  practice  with  the  “kernel  trick,” 
which  makes  it  possible  to  efficiently  run  these  al¬ 
gorithms  over  an  expanded  feature  space  and  thus 
potentially  learn  from  labeled  data  that  is  not  lin¬ 
early  separable  in  R".  The  “polynomial  kernel”  is  a 
popular  kernel  to  use  in  this  way;  when,  as  is  usu¬ 
ally  the  case,  the  degree  parameter  in  the  polynomial 
kernel  is  set  to  be  a  small  constant,  these  algorithms 
output  hypotheses  that  are  equivalent  to  low-degree 
PTFs.  Low-degree  PTFs  are  also  used  as  hypothe- 
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ses  in  several  important  learning  algorithms  with  a 
more  complexity-theoretic  flavor,  such  as  the  low- 
degree  algorithm  of  Linial  et  al.  [21]  and  its  variants 
[12,  22],  including  some  algorithms  for  distribution- 
specific  agnostic  learning  [14,  20,  3,  6]. 

Given  the  importance  of  learning  algorithms  that 
construct  low-degree  PTF  hypotheses,  it  is  a  natural 
goal  to  study  the  limitations  of  learning  algorithms 
that  work  in  this  way.  On  the  positive  side,  it  is  well 
known  that  if  there  is  a  PTF  (of  constant  degree  d) 
that  is  consistent  with  all  the  examples  in  a  data 
set,  then  a  consistent  hypothesis  can  be  found  in 
polynomial  time  simply  by  using  linear  programming 
(with  the  O (nd)  monomials  of  degree  at  most  d  as  the 
variables  in  the  LP).  However,  the  assumption  that 
some  low-degree  PTF  correctly  labels  all  examples 
seems  quite  strong;  in  practice  data  is  often  noisy  or 
too  complex  to  be  consistent  with  a  simple  concept. 
Thus  we  are  led  to  ask:  if  no  low-degree  PTF  classifies 
an  entire  data  set  perfectly,  to  what  extent  can  the 
data  be  learned  using  low-degree  PTF  hypoptheses? 

In  this  paper,  we  address  this  question  under  the 
agnostic  learning  framework  [11,  16].  Roughly  speak¬ 
ing,  a  function  class  C  is  agnostically  learnable  if  we 
can  efficiently  find  a  hypothesis  that  has  accuracy  ar¬ 
bitrarily  close  to  the  accuracy  of  the  best  hypothesis 
in  C.  Uniform  convergence  results  [11]  imply  that 
learnability  in  this  model  is  essentially  equivalent  to 
the  ability  to  come  up  with  a  hypothesis  that  cor¬ 
rectly  classifies  almost  as  many  examples  as  the  op¬ 
timal  hypothesis  in  the  function  class.  This  problem 
is  sometimes  referred  to  as  a  “Maximum  Agreement” 
problem  for  C.  As  we  now  describe,  this  problem  has 
previously  been  well  studied  for  the  class  C  of  halfs¬ 
paces. 

Related  Work.  The  Maximum  Agreement  prob¬ 
lem  for  halfspaces  over  R"  was  shown  to  be  NP- 
hard  to  approximate  within  some  constant  factor  in 
[1,  2].  The  inapproximability  factor  was  improved 
to  84/85  +  e  in  [4],  which  showed  that  this  hardness 
result  applies  even  if  the  examples  must  lie  on  the  n- 
dimensional  Boolean  hypercube.  Finally,  a  tight  in¬ 
approximability  result  was  established  independently 
in  [10]  and  [7];  these  works  showed  that  for  any  con¬ 
stant  e  >  0,  it  is  NP-hard  to  find  a  halfspace  consis¬ 
tent  with  (|  +  e)  of  the  examples  even  if  there  exists 
a  halfspace  consistent  with  (1  —  e)  of  the  examples. 
(It  is  trivial  to  find  a  halfspace  consistent  with  half  of 
the  examples  since  either  the  constant-0  or  constant- 1 
halfspace  will  suffice.)  The  reduction  in  [7]  produced 
examples  with  real-valued  coordinates,  whereas  the 
proof  in  [10]  yielded  examples  that  lie  on  the  Boolean 
hypercube. 

Thanks  to  these  results  the  Maximum  Agreement 


problem  is  well-understood  for  halfspaces,  but  the 
situation  is  very  different  for  low-degree  PTFs.  Even 
for  degree-2  PTFs  no  hardness  results  were  previously 
known,  and  recent  work  [6]  has  in  fact  given  efficient 
agnostic  learning  algorithms  for  low-degree  PTFs 
under  specific  distributions  on  examples  such  as 
Gaussian  distributions  or  the  uniform  distribution 
over  {—1,1}"  (though  it  should  be  noted  that  these 
distribution-specific  agnostic  learning  algorithms  for 
degree-d  PTFs  are  not  proper  -  they  output  PTF 
hypotheses  of  degree  d).  In  this  paper  we  make 
the  first  progress  on  this  problem,  by  establishing 
strong  hardness  of  approximation  results  for  the 
Maximum  Agreement  problem  for  low-degree  PTFs. 
Our  results  directly  imply  corresponding  hardness 
results  for  agnostically  learning  low  degree  PTFs 
under  arbitrary  distributions;  we  present  all  these 
results  below. 

Main  Results.  Our  main  results  are  the  follow¬ 
ing  two  theorems.  The  first  result  establishes  UGC- 
hardness  of  finding  a  nontrivial  degree-  d  PTF  hypoth¬ 
esis  even  if  some  degree-  d  PTF  has  almost  perfect 
accuracy: 

Theorem  1.1.  Fix  e  >  0,  d  ^  1.  Assuming  the 
Unique  Games  Conjecture,  no  polynomial-time  algo¬ 
rithm  can  find  a  degree-d  PTF  that  is  consistent  with 
(|  +  e)  fraction  of  a  given  set  of  labeled  examples  in 
R"  x  {—1, 1},  even  if  there  exists  a  degree-d  PTF  that 
is  consistent  with  a  1  —  e  fraction  of  the  examples. 

The  second  result  shows  that  it  is  NP-hard  to 
find  a  degree-2  PTF  hypothesis  that  has  nontrivial 
accuracy  even  if  some  halfspace  has  almost  perfect 
accuracy: 

Theorem  1.2.  Fix  e  >  0  .It  is  NP-hard  to  find  a 
degree-2  PTF  that  is  consistent  with  (|  +  e)  fraction 
of  a  given  set  of  labeled  examples  in  R"  x  {—1,1}, 
even  if  there  exists  a  halfspace  (degree- 1  PTF)  that  is 
consistent  with  a  1  —  e  fraction  of  the  examples. 

As  noted  above,  both  problems  become  easy 
(using  linear  programming)  if  the  best  hypothesis  is 
assumed  to  have  perfect  agreement  with  the  data  set 
rather  than  agreement  1  —  e,  and  it  is  trivial  to  find 
a  (constant-valued)  hypothesis  with  agreement  rate 
1/2  for  any  data  set.  Thus  the  parameters  in  both 
hardness  results  are  essentially  the  best  possible. 

These  results  can  be  rephrased  as  hardness  of 
agnostic  learning  results  in  the  following  way:  (i) 
Assuming  the  Unique  Games  Conjecture,  even  if 
there  exists  a  degree-d  PTF  that  is  consistent  with 
1  —  e  fraction  of  the  examples,  there  is  no  efficient 
proper  agnostic  learning  algorithm  that  can  output 
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a  clegree-d  PTF  correctly  labeling  more  than  ^  +  e 
fraction  of  the  examples;  (ii)  Assuming  P  7^  NP, 
even  if  there  exists  a  halfspace  that  is  consistent  with 

1  —  e  fraction  of  the  examples,  there  is  no  efficient 
agnostic  learning  algorithm  that  can  find  a  clegree-2 
PTF  correctly  labeling  more  than  |  +  e  fraction  of 
the  examples. 

Organization.  In  Section  2  we  present  the 
complexity-theoretic  basis  (the  Unique  Games  con¬ 
jecture  and  the  NP-hardness  of  Label  Cover)  of  our 
hardness  results.  In  Section  3  we  sketch  a  new  proof 
of  the  hardness  of  the  Maximum  Agreement  problem 
for  halfspaces,  and  give  an  overview  of  how  the  proofs 
of  Theorems  1.1  and  1.2  build  on  this  basic  argument. 
In  Sections  4  and  5  we  prove  Theorems  1.1  and  1.2. 

Notational  Preliminaries:  For  n  €  Z+  we  denote 
by  [?r]  the  set  {1,  ...,n}.  For  i,j  G  Z+,  i  <  j,  we 
denote  by  [i,j]  the  set  {i,i  +  l,...,j}.  We  write 
{j  :  m}  to  denote  the  multi-set  that  contains  to  copies 
of  the  element  j.  We  write  Xs{%)  to  denote  II  ■es2'*’ 
the  monomial  corresponding  to  the  multiset  S. 

2  Complexity-theoretic  preliminaries 

We  recall  the  Unique  Games  problem  that  was  intro¬ 
duced  by  Khot  [17]: 

Definition  2.1.  A  Unique  Games  instance  C  is 
defined  by  a  tuple  (U,V,  E,k,H).  Here  U  and  V 
are  the  two  vertex  sets  of  a  regidar  bipartite  graph 
and  E  is  the  set  of  edges  between  U  and  V.  II  is  a 
collection  of  bijections,  one  for  each  edge:  II  =  {ne  : 
[fc]  — »  [k]}e£E  where  each  ne  is  a  bisection  on  [k\. 
A  labeling  £  is  a  function  that  maps  U  — >  [fc]  and 
V  — y  [fc] .  We  say  that  an  edge  e  =  (u,  v)  is  satisfied 
by  labeling  l  if  ne(£(v))  =  £(u).  We  define  the  value 
of  the  Unique  Games  instance  C,  denoted  Opt(£),  to 
be  the  maximum  fraction  of  edges  that  can  be  satisfied 
by  any  labeling. 

The  Unique  Games  Conjecture  (UGC)  was  pro¬ 
posed  by  Khot  in  [17]  and  has  led  to  many  improved 
hardness  of  approximation  results  over  those  which 
can  be  achieved  assuming  only  P  7^  NP: 

Conjecture  2.2  (Unique  Games  Conjecture). 

1  Fix  any  constant  p  >  0.  For  sufficiently  large 
k  =  k(rf),  given  a  Unique  Games  instance  C  = 
(U,  V ,  E,  k,  II)  that  is  guaranteed  to  satisfy  one  of  the 
following  two  conditions,  it  is  NP  -hard  to  determine 
which  condition  is  satisfied:  Opt(£)  ^  1  —  r],  or 
Opt(£)  < 

1  Wft  use  the  statement  from  [18]  which  is  equivalent  to  the 
original  Unique  Games  Conjecture. 


Our  first  hardness  result,  Theorem  1.1,  is  proved 
under  the  the  Unique  Games  Conjecture.  Our  second 
hardness  result,  Theorem  1.2,  uses  only  the  assump¬ 
tion  that  P  7^  NP;  the  proof  employs  a  reduction 
from  the  Label  Cover  problem,  defined  below. 

Definition  2.3.  A  Label  Cover  instance  C  is  defined 
by  a  tuple  (U,  V,  E,  k,  m,  II) .  Here  U  and  V  are  the 
two  vertex  sets  of  a  regular  bipartite  graph  and  E  is 
the  set  of  edges  between  U  and  V.  II  is  a  collection 
of  “projections'’ ,  one  for  each  edge:  II  =  {7re  :  [m]  — > 
[k]}e(zE  and  m,k  are  positive  integers.  A  labeling  £ 
is  a  function  that  maps  U  — >  [k]  and  V  — »  [m] .  We 
say  that,  an  edge  e  =  (u,v)  is  satisfied  by  labeling  £ 
if  ne(£(v))  =  £(u).  We  define  the  value  of  the  Label 
Cover  instance,  denoted  Opt(£),  to  be  the  maximum 
fraction  of  edges  that,  can  be  satisfied  by  any  labeling. 

We  use  the  following  theorem  [24]  which  estab¬ 
lishes  NP-hardness  of  a  “gap”  version  of  Label  Cover: 

Theorem  2.4.  Fix  any  constant  rj  >  0.  Given  a  La¬ 
bel  Cover  instance  C  =  (U,  V,  E,  k,  to,  II)  that,  is  guar¬ 
anteed  to  satisfy  one  of  the  following  two  conditions, 
it  is  NP  -hard  to  determine  which  condition  is  satis¬ 
fied:  Opt(£)  =  1,  or  Opt(£)  <  1/mT 

3  Overview  of  our  arguments 

To  illustrate  the  structure  of  our  arguments,  let  us 
begin  by  sketching  a  proof  of  the  following  hardness 
result  for  the  Maximum  Agreement  problem  for  half¬ 
spaces: 

Proposition  3.1.  Assuming  the  Unique  Games 
Conjecture,  no  polynomial-time  algorithm  can  find 
a  halfspace  (degree- 1  PTF)  that  is  consistent  with 
(|  +  e)  fraction  of  a  given  set  of  labeled  examples  in 
R"  x  {—1, 1},  even  if  there  exists  a  halfspace  that,  is 
consistent,  with,  a  1  —  e  fraction  of  the  examples. 

As  mentioned  above,  the  same  hardness  result 
(based  only  on  the  assumption  that  P  ^  NP)  has 
already  been  established  in  [7,  10];  indeed,  we  do  not 
claim  Proposition  3.1  as  a  new  result.  However,  the 
argument  sketched  below  is  different  from  (and,  we 
believe,  simpler  than)  the  other  proofs;  it  helps  to 
illustrate  how  we  eventually  achieve  the  more  general 
hardness  results  Theorems  1.1  and  1.2. 

Proof  Sketch  for  Proposition  3.1:  We  describe 
a  reduction  that  maps  any  instance  C  of  Unique 
Games  to  a  set  of  labeled  examples  with  the  following 
guarantee:  if  Opt(£)  is  very  close  to  1  then  there 
is  a  halfspace  that  agrees  with  1  —  e  fraction  of  the 
examples,  while  if  Opt(£)  is  very  close  to  0  then  no 
halfspace  agrees  with  more  than  |  +  e  fraction  of  the 
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examples.  A  reduction  of  this  sort  directly  yields 
Proposition  3.1. 

Let  C  =  ( U ,  V,  E,  k,  II)  be  a  Unique  Games 
instance.  Each  example  generated  by  the  reduction 
has  (|U  +  \U\)k  coordinates,  i.e.  the  examples  lie 
in  IRdy+lUA.  The  coordinates  should  be  viewed  as 
being  grouped  together  in  the  following  way:  there 
is  a  block  of  k  coordinates  for  each  vertex  w  in 
Pub.  We  index  the  coordinates  of  x  £  Dfc 

as  x  =  {x$)  where  w  £  U  U  V  and  i  €  [k]. 

Given  any  function  /  :  — >  {—1,1} 

and  vertex  w  €  U  U  V,  we  write  fw  to  denote  the 
restriction  of  /  to  the  k  coordinates  (Xw Jielfcl  that  is 

obtained  by  setting  all  other  coordinates  (x$)wi^w 
to  0.  Similarly,  for  e  =  {u,  z;}  an  edge  in  U  x  V,  we 
write  fe  for  the  restriction  that  fixes  all  coordinates 
(x<u))w'efe  to  0  and  leaves  the  2k  coordinates  Xu\xy  ^ 
unrestricted. 

For  every  labeling  £  :  [fU  V  — >  [k]  of  the  instance, 
there  is  a  corresponding  halfspace  over  l+ltfl)fc 

sign(^xW“»-^^»). 

uGU  vev 

Given  a  Unique  Games  instance  £,  the  reduction 
constructs  a  distribution  V  over  labeled  examples 
such  that  if  Opt(£)  is  almost  1  then  the  above 
halfspace  has  very  high  accuracy  w.r.t.  D,  and  any 
halfspace  that  has  accuracy  at  least  \  +  e  yields  a 
labeling  that  satisfies  a  constant  fraction  of  edges  in 
C.  A  draw  from  V  is  obtained  by  first  selecting  a 
uniform  random  edge  e  =  {u,  z;}  from  E ,  and  then 
making  a  draw  from  T>e ,  where  T>e  is  a  distribution 
over  labeled  examples  that  we  describe  below. 

Fix  an  edge  e  =  (u,  v).  For  the  sake  of  exposition, 
let  us  assume  the  mapping  ttc  £  II  associated  with  e 
is  the  identity  permutation,  i.e.  n e(i)  =  i  for  every 
i  £  [k] .  The  distribution  T>e  will  have  the  following 
properties: 

(i)  For  every  (y.  b)  in  the  support  of  T>e,  all  coordi¬ 
nates  y$  for  every  vertex  w  £  e  are  zero. 

(ii)  For  every  label  i  G  [k],  the  halfspace  sign(cciI')  — 
Xv)  has  accuracy  1  —  e  w.r.t.2?e. 

(iii)  If  sign(/e)  is  a  halfspace  that  has  accuracy 
at  least  ^  +  e  w.r.t.  T>e,  then  the  functions 
fu ,  fv  can  each  be  individually  “decoded”  to 
a  “small”  (constant-sized)  set  SU,SV  C  [k]  of 
labels  such  that  Su  fl  Sv  ^  0  (so  a  labeling 
that  satisfies  a  nonnegligible  fraction  of  edges  in 
expectation  can  be  obtained  simply  by  choosing 
a  random  label  from  Sw  for  each  w  -  such  a 
random  choice  will  satisfy  each  edge’s  bijection 


with  constant  probability,  so  in  expectation  will 
satisfy  a  constant  fraction  of  constraints) . 


Let  us  explain  item  (iii)  in  more  detail.  Since 
the  distribution  Ve  is  supported  on  vectors  y  that 
have  the  (y$)w£e  coordinates  all  0,  the  distribution 
T>e  only  “looks  at”  the  restriction  fe  of  /,  which 
is  a  halfspace  on  R2fe.  Thus  achieving  (iii)  can  be 
viewed  as  solving  a  kind  of  property  testing  problem 
which  may  loosely  be  described  as  “Matching  dictator 
testing  for  halfspaces.”  To  be  more  precise,  what 
is  required  is  a  distribution  T>e  over  2fc-dimensional 
labeled  examples  and  a  “decoding”  algorithm  A 
which  takes  as  input  a  fc-variable  halfspace  and 
outputs  a  set  of  coordinates.  Together  these  must 
have  the  following  properties: 


•  (Completeness)  If  fe(x)  =  x $  —  x^  then 
sign(/e(y))  =  b  with  probability  1  —  e  for  (y,  b)  ~ 
D,: 


•  (Soundness)  If  fe  is  such  that  sign (fe(y))  =  b 
with  probability  at  least  1/2  +  e  for  (y,  b)  drawn 
from  T>e,  then  the  output  sets  A(fu),  A(fv)  of 
the  decoding  algorithm  (when  it  is  run  on  fu  and 
/„  respectively)  are  two  small  sets  that  intersect 
each  other. 


Testing  problems  of  this  general  form  are  often  re¬ 
ferred  to  as  Dictatorship  Testing ;  the  design  and  anal¬ 
ysis  of  such  tests  is  a  recurring  theme  in  hardness  of 
approximat  ion . 

We  give  a  “matching  dictator  test  for  halfspaces” 
below.  More  precisely,  in  the  following  figure  we  de¬ 
scribe  the  distribution  De  over  examples  (the  decod¬ 
ing  algorithm  A  is  described  later). 
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7) :  Matching  Dictatorship  Test  for 
Halfspaces 

Input:  A  halfspace  fe  :  R2fc  — >  R. 

Set  ^=^,5“  1/2*. 

1.  Generate  independent  0/1  bits  ai,a2, . . .  ,ak 
each  with  E[oj]  =  e.  Generate  2k  inde¬ 
pendent  N( 0, 1)  Gaussian  random  variables: 
h\ ,  h2  . . . ,  /ifc ,  gi ,  g2  ■  ■  . ,  gk-  Generate  a  ran¬ 
dom  bit  b  G  {—1, 1}. 

2.  Set  r  =  (aihi  +  gi, . . . ,  akhk  +  9k,9i,---,9k) 
and  lo  =  (1, . . . ,  1, 0, . . . ,  0)  €  R2fc  to  be  the 
vector  whose  first  k  coordinates  are  1  and  last 
k  coordinates  are  0. 

3.  Set  y  =  r  +  bSoj .  The  result  of  a  draw  from 
T>e  is  the  labeled  example  (y,  b) . 

The  test  checks  whether  sign(/e(y))  equals  b. 

It  is  useful  to  view  the  test  in  the  following  light: 
Let  us  write  fe(x)  as  0+Yn=i  Xu'* +J2i= i  Xy\ 

and  let  us  suppose  that  J2i=i\wu^\  =  1  (as  long 
as  some  vj\P  is  nonzero  this  is  easily  achieved  by 
rescaling;  for  this  intuitive  sketch  we  ignore  the  case 
that  all  iihiP  are  0,  which  is  not  difficult  to  handle). 
Then  we  have  fe(y)  =  fe(r)+bS,  and  we  may  view  the 
test  as  randomly  choosing  one  of  the  two  inequalities 
fe(r)  —  8  <  0,  fe(r)  —  8  >  0  and  checking  that  it 
holds.  Since  at  least  one  of  these  inequalities  must 
hold  for  every  fe,  the  probability  that  fe  passes  the 
test  is  \  +  |Prr[/e(r)  €  [—5,5)].  This  interpretation 
will  be  useful  both  for  analyzing  completeness  and 
soundness  of  the  test. 

For  completeness,  it  is  easy  to  see  that  the 
“matching  dictator”  function  fe  (x)  =  x$  —  xi'P  has 
/e(r)  =  aihi  and  thus  Pr[/e(r)  =  0]  =  1  —  e,  so  this 
function  indeed  passes  the  test  with  probability  1  —  e. 

The  soundness  analysis,  which  we  now  sketch, 
is  more  involved.  Let  /  be  such  that  Pry[/e(r)  G 
[-5,5)]  >  2e.  Since  fe(r)  =  +  wPpgt  + 

X)  w\P aihi  and  g j,  hi  are  i.i.d.  Gaussians,  conditioned 
on  a  given  outcome  of  the  a, -bits  the  value  /e(r) 
follows  the  Gaussian  distribution  with  mean  0  and 
variance  J2(VJu'>  +  Wv'>)2  +  XXa* wu'>)2-  Now  recall 
that  an  N(0,a)  Gaussian  random  variable  lands  in 
the  interval  [— t,t]  with  probability  at  most  0(t/cr). 
So  any  a- vector  for  which  the  variance  + 

wi1'1)2  +  J2(aiVJu'>)2  is  n°t  “tiny”  can  contribute  only 
a  negligible  amount  to  the  overall  probability  that 
/e(r)  lies  in  [—5,5)  (recall  that  5  is  extremely  tiny). 


Since  by  assumption  Pr r[/e(r)  G  [—5,5)]  is  non- 
negligible  (at  least  2e),  there  must  be  a  non-negligible 
fraction  of  a- vector  outcomes  that  make  the  variance 
XX +wpp2  +  J2(aiwup2  be  “tiny.”  This  implies 

C  a) 

that  there  must  be  only  a  “few”  coordinates  Wu 
for  which  |  is  not  tiny  (for  if  there  were  many 
non-tiny  Wu^  coordinates,  then  Pi)2  would  be 

non-tiny  with  probability  nearly  1  over  the  choice  of 
the  a- vector).  Moreover,  w?  +  must  be  ss  0  for 

each  i,  so  for  each  i  the  magnitudes  |wu')|  and  |io^| 
must  be  nearly  equal;  and  in  particular,  each  [wlP 
is  large  if  and  only  if  |  is  large.  Finally,  since 
J2i  equals  1  some  ruin’s  must  be  large  (at  least 
1/k). 

With  these  facts  in  place,  the  appropriate  de¬ 
coding  algorithm  A  is  rather  obvious:  given  fu  = 
9  +  £i=i  Wu^Xu'1  as  input,  A  outputs  the  set  Su  of 
those  coordinates  i  for  which  |ruu')|  is  large  (and  sim¬ 
ilarly  for  /„).  This  set  cannot  be  too  large  since 
X)-=i  equals  1.  Now  a  labeling  that  satisfies 
edge  e  with  non-negligible  probability  can  be  ob¬ 
tained  by  outputing  a  random  element  from  Su  and 
a  random  element  from  Sv ;  since  these  sets  are  small 
there  is  a  non-negligible  probability  that  the  labels 
will  match  as  required.  This  concludes  the  proof 
sketch  of  Proposition  3.1.  □ 

Overview  of  the  proofs  of  Theorems  1.1 
and  1.2.  For  Theorem  1.1  (hardness  of  properly 
learning  degree-d  PTFs),  we  must  deal  with  the 
additional  complication  of  handling  the  cross-terms 
such  as  XuXv  between  u- variables  and  v- variables 
that  may  be  present  in  clegree-d  PTFs.  As  an  ex¬ 
ample  of  how  such  cross-terms  can  cause  problems, 
observe  that  the  degree-3  polynomial  fe  =  (Xu1  — 
XvP  XXa’“^)2  would  pass  the  test  T\  with  high  prob¬ 
ability,  but  this  polynomial  has  fv  =  0  so  there  is 
no  way  to  successfully  “decode”  a  good  label  for  v. 
To  get  around  this,  we  modify  the  test  7)  to  set 
y  =  (aihi  +  gf  +  b8,  a2h2  +  g2  +  bS, . . . ,  akhk  +  gk  + 
b8,  gi, . . . ,  <7^);  intuitively  this  modified  test  checks 
whether  the  polynomial  fe  is  of  the  form  Xu'1  —  (xiP)d. 
The  bulk  of  our  work  is  in  analyzing  the  soundness  of 
this  test;  we  show  that  any  polynomial  fe  that  passes 
the  modified  test  with  probability  significantly  better 
than  1/2  must  have  almost  no  coefficient  weight  on 
cross-terms,  and  that  in  fact  the  restricted  polynomi¬ 
als  fu,  fv  can  each  be  decoded  to  a  small  set  in  such  a 
way  that  there  is  a  matching  pair  as  desired.  We  give 
a  complete  description  and  analysis  of  our  Dictator 
Test  and  prove  Theorem  1.1  in  Section  4. 

For  Theorem  1.2,  a  first  observation  is  that  the 
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test  7j  in  fact  already  has  soundness  3/4  +  e  for 
degree-2  PTFs.  To  see  this,  we  begin  by  writing 
the  clegree-2  polynomial  fe(x)  as  9  +  fi(x)  +  /2(a;) 
where  fi(x)  is  the  linear  (degree  1)  part  and  /2(a:) 
is  the  quadratic  (degree  2)  part  (note  that  /i  is 
an  odd  function  and  /2  is  an  even  function).  We 
next  observe  that  since  any  vector  r  is  generated 
with  the  same  probability  as  —  r,  the  test  may  be 
viewed  as  randomly  selecting  one  of  the  following  4 
inequalities  to  verify:  fJr  +  Suj)  >  0,  fe(r  —  Suj)  <  0, 
/e(-r  +  Suj)  >  0,  /e( — r  -  Suj)  <  0.  If  all  four 
inequalities  hold,  then  combining  fe(r  +  Suj)  >  0 
with  fe(—r  —  Suj)  <  0  we  get  that  f±(r  +  Su>)  >  0 
and  combining  fe(r  —  Suj)  <  0  with  fe(— r  +  Suj)  >  0 
we  get  fi(r  —  Suj)  <  0.  Consequently,  if  a  clegree-2 
polynomial  fe  passes  the  test  with  probability  3/4+e, 
then  by  an  averaging  argument,  for  at  least  an  e 
fraction  of  the  r-outcomes  all  four  of  the  inequalities 
must  hold.  This  implies  that  for  an  e  fraction  of  the 
r’s  we  must  have  /i  (r  +  Suj)  >  0  and  /i  (r  —  Suj)  <  0, 
and  so  the  degree-1  PFT  /i  must  pass  the  Dictator 
Test  7)  with  probability  at  least  1/2  +  e.  This 
essentially  reduces  to  the  problem  of  testing  degree- 1 
PTFs,  whose  analysis  is  sketched  above. 

To  get  the  soundness  down  to  1/2  more  work  has 
to  be  done.  Roughly  speaking,  we  modify  the  test 
by  checking  that  sign(/(fcir  +  k^Suj))  =  sign (fc2)  for 
kj ,  k'2  generated  from  a  carefully  constructed  distri¬ 
bution  in  which  k\ ,  /c2  can  assume  many  different  pos¬ 
sible  orders  of  magnitude.  Using  these  many  different 
possibilities  for  the  magnitudes  of  k\ ,  fc2 ,  a  careful 
analysis  (based  on  carefully  combining  inequalities 
in  a  way  that  is  similar  to  the  previous  paragraph, 
though  significantly  more  complicated)  shows  that  if 
a  polynomial  passes  the  test  with  probability  1/2  +  e 
fraction  then  it  can  be  “decoded”  to  a  small  set  of  co¬ 
ordinates.  In  addition  to  this  modification,  to  avoid 
using  the  Unique  Games  Conjecture  we  employ  the 
“folding  trick”  that  is  proposed  in  [9,  19]  to  ensure 
consistency  across  different  vertices.  One  benefit  of 
using  this  trick  is  that  with  it,  we  only  need  to  de¬ 
sign  a  test  on  one  vertex  instead  of  an  edge.2  The 
complete  proof  of  Theorem  1.2  appears  in  Section  5. 

4  Hardness  of  proper  learning  noisy  degree-  d 

PTFs:  Proof  of  Theorem  1.1 
4.1  Dictator  Test  Let  /  :  R2n  — >  R  be  a  2 n- 
variable  degree-d  polynomial  over  the  reals.  The  key 
gadget  in  our  UG-hardness  reduction  is  a  dictator 

“The  reason  that  we  can  not  use  “folding”  for  our  first  result 
on  low-degree  PTFs,  roughly  speaking,  is  that  such  a  folding 
does  not  seem  able  to  handle  cross-terms  of  degree  greater  than 
2. 


test  of  whether  /  is  of  the  form  sign(aij  —  x^+i)  for 
some  i  £  [n].  More  concretely,  our  dictator  test 
queries  the  value  of  /  on  a  single  point  y  £  K2n 
and  decides  to  accept  or  reject  based  on  the  value 

sign  (f(y))- 


Td :  Matching  Dictator  Test  for  degree- d 
PTFs 

Input:  A  degree-d  real  polynomial  /  :  R2n  — »  R. 

Set  /3  :=  1/logn  and  S  :=  2~n2 . 

1.  Generate  n  i.i.cl.  bits  a,;  £  {0, 1}  with  Pr[aj  = 

1  ]  =  /3,  i  £  [n].  Generate  2 n  i.i.cl.  JV(0,1) 
Gaussians  {/i,,  <?*}"_ j.  Generate  a  uniform 
random  bit  b  £  {—1, 1}. 

2.  Set  y  =  (f/j)2" i  where  y.i  =  ajhi  +  gf  +  bS  and 
Vn+i  =  gi,  it  [n]. 

3.  Accept  iff  sign(/(y))  =  b. 

We  can  now  state  and  prove  the  properties  of  our 
test.  The  completeness  is  straightforward. 

Lemma  4.1  (Completeness).  The  polynomial 
f(x)  =  Xi  —  x^+i  passes  the  test  with  probability  at 
least  1  —  j3. 

Proof.  Note  that  f(y)  =  aihi  +  bS.  Hence  if  a,  = 
0  we  have  sign(/(j/))  =  b  and  this  happens  with 
probability  1  —  (3.  □ 

To  state  the  soundness  lemma  we  need  some 
more  notation.  For  a  clegree-d  polynomial  /( x)  = 
Esc[n],|SKdcs  -Xs(x)  we  denote  wt(/)  = 

For  6  >  0,  we  define  Ie{f)  ■=  {*  €  [n]  |  3 S  9 
i  s.t.  |cs|  ^  9  ■  wt(/)/(”Jd)}.  Note  that  for  9  £ 
[0,1]  we  have  that  Ig(f)  yf  0,  since  there  are 
n^d)  nonempty  monomials  of  degree  at  most  d  over 
X\ , . . . ,  xn. 

Let  f  :  R2n  ->  1  be  a  2n-variable  polynomial 
f(x)  =  Esc[2n],|SKdcS  •  XsO)  fed  as  input  to  our 
test.  We  will  consider  the  restrictions  obtained  from 
/  by  setting  the  first  (resp.  second)  half  of  the 
variables  to  0.  In  particular,  for  x  =  (xi , . . . ,  X2n)  we 
shall  denote  fi(x\, . . .  ,xn)  =  f(x±, . . . , xn,  0n)  and 
/*2  (nA-t-l  ?  •  *  *  5  X2 n)  —  /"(Ori ,  ,  .  .  .  ,  X2n)- 

We  are  now  ready  to  state  our  soundness  lemma. 
The  proof  of  this  lemma  poses  significant  complica¬ 
tions  and  constitutes  the  bulk  of  the  analysis  in  this 
section. 

Lemma  4.2  (Soundness).  Suppose  that  f(x)  = 
Esc[2ral  |S|^dcS"Xs(a:)  passes  the  test  with  probability 
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at  least  1/2  +  fi.  Then  for  fi,  f2  as  defined  above,  we 
have  |^o.5  (/i )  I  ^  1  / /32,  |^i  (/2)  |  ^  1  / /32 .  In  addition, 
every  i  €  [n]  such  that  n  +  i  £  Ii(/2)  also  satisfies 
i  €  Io.5  (/i)  ■ 

Proof.  We  can  assume  that  wt(/)  >  0,  since  other¬ 
wise  f  is  a  constant  function,  hence  passes  the  test 
with  probability  exactly  | .  Since  our  test  is  invariant 
under  scaling,  we  can  further  assume  that  wt  (/)  =  1. 
Let  x  €  R2n.  By  definition,  f\{x)  =  ^scycs  ' 

Xs(x)  and  f2(x)  =  Esc[„+i,2n]cs  •  Xs{x).  We  can 
write 

f(x)  =  fi(x)  +  f2(x)  +  fi2{x) 

where  fl2(x)  =  Esc[2n],Sn[r!]#0,Sn[n+l,2n]^0CS'  ■ 

Xs(x). 

Let  us  start  by  giving  a  very  brief  overview 
of  the  argument.  The  proof  proceeds  by  carefully 
analyzing  the  structure  of  the  coefficients  eg  for  the 
subfunctions  fi,f2,fi2.  In  particular,  we  show  that 
the  total  weight  of  the  cross  terms  (i.e.  wt(/i2))  is 
negligible,  and  that  the  weight  of  /  is  roughly  equally 
spread  among  /i  and  f2.  Moreover,  the  coefficients  of 
/i  i  fi  are  either  themselves  negligible  or  “matching” 
(see  inequalities  (i)-(iv)  below).  Once  these  facts  have 
been  established,  it  is  not  hard  to  complete  the  proof. 

The  main  step  towards  achieving  this  goal  is  to 
relate  the  coefficients  cs  with  the  coefficients  of  an 
appropriately  chosen  restriction  of  /,  obtained  by 
carefully  choosing  an  appropriate  value  of  a  £  {0, 1}". 
We  start  with  the  following  crucial  claim: 

Claim  4.3.  Suppose  f  passes  the  test  with  probability 
at  least  1/2  +  ft.  Then  there  exists  a '  €  {0, 1}"  such 
that 

||/a'||2<2-"-logd  n. 

Proof  of  Claim  f.3.  Let  us  start  be  giving  an  equiv¬ 
alent  description  of  the  test.  Denote  w  =  (1„,0„)  £ 
R2n,  r  =  (ri)2!^  with  rt  =  alhi  +  gf  and  rn+i  =  gu 
i  £  [n].  Note  that  y  =  r  +  ( b6)ui .  Then  the  Dictator 
Test  is  as  follows: 

•  Generate  r,  and  with  probability  1/2,  test 
whether  f(r+Suj )  ^  0;  otherwise  test  f(r—5iv)  < 

0. 

Hence,  since  /  passes  with  probability  1/2  +  /3,  with 
probability  at  least  2/3  over  the  choice  of  r,  the 
following  inequalities  are  simultaneously  satisfied: 

f(r  +  Sea)  ^  0;  f(r  —  Sui)  <  0. 

We  now  upper  bound  |/(r  +  5lo)  —  f{r)  |: 


|  f(r  +  5u>)  -  f(r)  | 

=  lEisKdds  ’  (lUsnM^  +  <5) 

'  Y\jeSn[n+l,2n]rj  ~  riieSr*)  I 

^  •  (E0^TCSn[n]^'T'  ‘  rWN) 

^  El^lSKdl0®!  ■  2^  •  (s  ■  riieS:ri^llril) 
The  last  inequality  follows  from  the  fact  that 
there  are  at  most  2 Is!  terms  in  the  second  summation 
each  bounded  from  above  by  S  ■  Tlies-r ■>ilr*l- 

We  now  claim  that  with  probability  at  least  1  — 
n over  the  choice  of  r  it  holds  M  :=  rnax.(e r2ni  |r,|  ^ 
logd  n.  To  see  this  note  that  if  maxie[„]{|g,;|,  ^  c 

then  M  ^  2 cd.  Now  recall  that  for  g  ~  /V(0, 1)  and 
c  >  2  we  have  Pr[|p|  >  c]  ^  e~c  /2.  The  claim  follows 
by  fixing  c  =  ©(log1/2  n)  and  taking  a  union  bound 
over  the  corresponding  2 n  events. 

Therefore,  with  probability  1  —  n_1  over  the 
choice  of  r,  we  have 

\f(r  +  M  -  f(r)  |  ^6-2d-  (log  nf  •  wt  (/)  <  2"". 
Analogously  we  obtain  that  |/(r)  —  f(r  —  Siv)  |  <  2~n. 
We  conclude  that  with  probability  2/3  — n_1  ^  /3  over 
r 

(4-1)  |/(r)|<2"". 

Recall  that  r  is  a  random  vector  that  depends  on 
a,g,h.  For  every  realization  of  a  £  {0,1}",  we 
denote  the  corresponding  restriction  of  /  as  fa(g ,  h); 
note  that  fa(g ,  h)  is  a  degree  d2  real  polynomial  over 
Gaussian  random  variables.  Let  us  denote  1 1 /a  1 1 2  := 
E  g,h[fa(g,h)2]1/2. 

At  this  point  we  appeal  to  an  analytic  fact 
from  [5]:  low  degree  polynomials  over  independent 
Gaussian  inputs  have  good  anti-concentration.  In 
particular,  an  application  of  Theorem  A. 2  for  fa(g ,  h) 
yields  that  for  all  a  £  {0, 1}"  it  holds 

Prg, h[\fa(g,h)\  <  2-]  <  d2  •  (2-"/||/0||2)1^2. 

Combined  with  (4.1)  this  gives 

/3  <  Pra,s,4|/a(5,/i)Kl/2"] 

<  E  \d2  ■  (2-"/||/a||2)1/d2'  . 

a  L 

Now  let  us  fix  a'  :=  argminag{0.i}n  ll/alb;  the  above 
relation  implies  (2“"/||/a/||2)1/d  ^  (3  or  ||/a-||2  < 

2~n(l/ f3)d~  as  desired.  This  completes  the  proof  of 
Claim  4.3.  □ 

Since  a'  is  fixed,  we  can  express  fa'  as  a  degree-d2 
polynomial  over  the  gf s  and  hfs.  Let  us  write 

fa'  =  Et,T'  wT,T'  •  fi ieTdi  ‘ 
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where  T,T'  C  [n]  are  multi-sets  satisfying  |T|  +  |T'|  < 
d2  and  wt,t'  =  Since  fa '  has  small 

variance,  intuitively  each  of  its  coefficients  should  also 
be  small.  The  following  simple  fact  establishes  such 
a  relationship: 

Fact  4.4.  Let  /  :  K*  — >  R  be  a  degree-d  polynomial 
f(x)  =  •  xs(x)  and  Q  ~  N(0,1)1.  For  all 

T  C  [l]  we  have  \\f(G)\\2  ^  d~d  ■  \cT\/(l+dd)- 

Proof  of  Fact  4-4-  The  fact  follows  by  expressing 
f  in  an  appropriate  orthonormal  basis.  Let 
{.ffs}scm,|SK<i  the  set  of  Hermite  polynomials  of 
degree  at  most  d  over  l  variables,  let  and  f(x)  = 
^SKd  f(S)Hs(x)  be  the  Hermite  expansion  of  /. 
Then,  ||/(f/)||2  =  J2f(S)2  which  clearly  implies  that 

11/(60  II 2  5*  maxs  1/(5)  | . 

Fix  an  S  C  [Z]  with  151  <  d.  By  basic  properties 
of  the  Hermite  polynomials  (see  e.g.  [13])  we  have 
that  Hs{x)  =  Ec/cs  hS  ■  Xu{x)  with  \h%\  <  dd. 
Hence,  for  a  fixed  T  C  [/],  ct  can  be  written  as 
EsDThsf(S)-  Since  S  C  [Z]  and  |5|  <  d,  there  are 
at  most  (Ed)  terms  in  the  summation.  Therefore, 
it  must  be  the  case  that  there  exists  some  S  such 
that  |/(5)  |  ^  d~d  •  |cT|/(^d)-  This  completes  the 
proof.  □ 

Notation:  For  the  remaining  of  this  proof  we  will  be 
interested  in  the  coefficients  wt,t’  for  T'  =  0.  For 
notational  convenience  we  shall  denote  wt  :=  w 
We  now  claim  that  for  all  T  we  have 

(4.2)  \wt\  <  n~10d. 

Using  Fact  4.4,  if  this  were  not  the  case  we  would  get 
a  contradiction  with  Claim  4.3. 

At  this  point  we  establish  the  relationship  be¬ 
tween  the  wt’s  and  the  coefficients  cs  of  f  in  our 
original  basis  {xs}- 

By  definition,  the  restriction  obtained  from 
fa'{g ,h)  by  setting  the  h,  variables  to  0  is  identical 
to  the  function  f(gf , . . . ,  gd,  g1, . . . ,  gn).  Therefore 
we  have 

(4-3)  Etckj^T  ‘  I! ierff*  = 

EscpnjdS  ■  rii£Sn[n]S,i  '  F^n-MIesS1* 

For  any  fixed  T  in  the  LHS  of  (4.3)  there  is  an 
equivalence  class  of  sets  S  in  the  RHS  such  that  the 
monomial  Ui€sn[n]9i  ‘  n(„+i)es5i  equals  Herdi-  It 
is  clear  that  Wt  equals  E scSi  where  the  sum  is  over 
all  S  in  the  equivalence  class.  In  fact,  the  structure  of 
the  equivalence  classes  is  quite  simple,  as  established 
by  the  following  claim: 


Claim  4.5.  For  any  Sq  5i  C  [2 n]  of  size  at  most 
d,  if 

(4.4)  U^S0n[n]9i  '  rirt+je Sq  J £[n] 9 :j 

KI'/C .S'i  n fn| Y\n+jGSi,jG[n]9j’> 

then  there  exists  some  l  £  [n]  such  that  Sq  =  {£}  and 
Si  =  {n  +  t  :  d}  or  vice  versa. 

Proof  of  Claim  4.5.  Consider  the  following  two 
complementary  cases. 

•  So  l~l  [n]  5i  fl  [n].  Without  loss  of  generality, 
we  can  assume  that  there  is  some  £  €  So  fl  [n] 
with  t  £  S\.  (Otherwise  the  role  of  5o,5i  can 
be  reversed.)  Then  to  make  (4.4)  hold,  it  must 
be  the  case  that  S\  contains  d  copies  of  n  +  l. 
Now,  since  |5i|  <  d,  it  can  only  be  the  case  that 
5i  =  {n  + 1 :  cZ},  which  implies  that  Sq  =  {£}. 

•  So  C  [n  +  l,2n]  5i  fl  [n  +  l,2n].  We  may 

assume  that  there  is  some  £  £  [n]  such  that 
(n  +  t)  £  So-  Then,  for  (4.4)  to  hold,  it  must 
be  the  case  that  £  £  Si.  Hence,  it  must  be  the 
case  that  Si  =  {n  +  £  :  d}  (since  g i  is  raised  to 
the  dth  power  in  the  RHS  of  (4.4));  this  in  turns 
enforces  5o  =  {£}■  □ 

Claim  4.5  implies  the  following  relation  between 
the  coefficients  cs  and  Wt’- 

(A)  If  T  =  {i  :  d},  for  some  i  £  [n],  then  we  have 
wt  =  cs1  +  cs2  with  5i  =  {*}  and  S2  =  {n  +  i  : 
d}. 

(B)  If  T  is  not  of  the  above  form,  then  there  exists  a 
multi-set  S  C  [2 n],  |5|  <  d,  where  S  yf  {*}  and 
S  y^  {n+i  :  d}  for  any  i  £  [n],  such  that  T  equals 
{i  :  d  |  i  £  5}  U  {i  \  n  +  i  £  5}.  In  this  case,  we 
have  wt  =  cs- 

We  are  now  ready  to  establish  the  desired  bounds 
on  the  coefficients  of  the  subfunctions  fi,  f 2,  fi2- 

(i)  For  all  S  C  [n]  with  |5|  ^  2,  (4.2)  and  (B)  yield 
|cs|  <  n~10d. 

(ii)  For  all  S  C  [n  +  1, 2n]  with  S  /  {n  +  i  :  d}  for 
some  i  £  [n],  (4.2)  and  (B)  yield  |cs|  <  n~wd. 

(iii)  For  all  i  £  [n],  by  (4.2)  and  (A)  we  obtain 

|  |C{i}  |  l^{n+i:.}  1 1  ^  lc{i}  T  ^{n+i:.}  I  ^  ^ 

(iv)  For  all  S  such  that  5fl[n]  y^  0  and  5fl[n+l,  2n]  y^ 
0,  (4.2)  and  (B)  yield  |cs|  <  n~10d. 

Since  the  coefficients  of  /1 ,  are  either  very  small 
(cases  (i),  (ii)  above)  or  matching  (case  (iii)),  we  get 
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|wt(/i)  —  Wt(/2)|  <  n~10d  ■  (n~^d)  <  n_1.  Moreover, 
since  every  coefficient  of  /12  is  small  (case  (iv)),  we 
deduce  that  wt(/i2)  ^  n~wd-(2n^d)  ^  n_1.  Recalling 
that  wt(/i)  +  wt(/2)  +  wt(/i2)  =  wt (/)  =  1,  we  get 
wt(/i)  +  wt(/2)  >  1  —  Combining  these  bounds, 
we  get  that 

(4.5)  0.51  ^  wt(/i),  wt(/2)  ^  0.49. 

Now  fix  an  i  £  [?r]  with  (n  +  i )  £  ii(/2).  The 
above  inequality  implies  that  there  must  exist  some 
S  9  (n  +  i)  such  that  |cs|  ^  0.49/("^d).  By  (ii),  we 
deduce  that  it  can  only  be  the  case  that  S  equals  {n+ 
i  :  d}  (as  all  other  coefficients  in  /2  are  very  small). 
Moreover,  (iii)  implies  that  |cj|  ^  0.48 (n~jid)  ,  hence 
i  £  Io.5(/i)  (recalling  that  wt(/i)  <  0.51).  So  we 
have  |Ti (/2) |  ^  |/o.s(/i)|  and  it  remains  to  bound 
from  above  the  size  of  lo.sifi)  by  (3~ 2 . 

Suppose  (for  the  sake  of  contradiction)  that 
\IoAfi)\  >  P~2-  Since  wt(/i)  ^  0.49,  every  j  £ 
Jo. 5  (/1)  comes  from  the  set  S  =  {.)}  (as  all  the  other 
coefficients  of  /1  are  too  small).  Consider  all  pos¬ 
sible  realizations  of  a  €  {0,  l}n.  With  probability 
1  —  (1  —  /3)lJo-5(/i)l  ^  1  —  n~l  over  the  choice  of 
a,  there  exists  i  €  do. 5 (/1)  with  a,  =  1.  Fix  such 
an  i.  By  the  definition  of  /o.s(/i)>  we  must  have 
|c{j}|  ^  0.5-0.49("+d)_1  ^  0.2 \n+dd)1- Hence,  there 
will  be  a  degree- 1  monomial  in  the  expansion  of  fa 
as  a  polynomial  over  g  and  h  whose  coefficient  has 
absolute  value  at  least  0.2  •  (n^d) 

The  aforementioned  and  Fact  4.4  imply  that  with 
probability  1  —  n^1  over  a  it  holds 

II/0H2  ^  ^2n+rf2^  ^2^d2  ^  ^  ^2d2  ^ 

By  Theorem  A. 2  and  the  fact  that  wt(/)  = 
1  we  get  that  Pra,g,h  [|/a(<7,  h)\  <  2~n]  is  at  most 
n”1  +  0(d2  ■  n 2  •  2~n/d  )  =  o(/3),  which  contradicts 
(4.1).  This  completes  the  proof  of  Lemma  4.2.  □ 

4.2  Hardness  reduction  from  Unique  Games 

With  the  completeness  and  soundness  lemmas  in 
place,  we  are  ready  to  prove  Theorem  1.1.  The 
hardness  reduction  is  from  a  Unique  Games  Instance 
£(U,  V ,  E,  II,  k)  to  a  distribution  of  positive  and  neg¬ 
ative  examples.  The  examples  lie  in  RfWI+lUA  and 
are  labeled  with  either  (+1)  or  (—1).  Denote  dim  = 

m  +  \v\)k. 

For  w  £  U  U  V  and  x  £  Mdlm,  we  use  x to 
denote  the  coordinate  corresponding  to  the  vertex 
w's  i- th  label.  We  use  xw  to  indicate  the  collec¬ 
tion  of  coordinates  corresponding  to  vertex  w;  i.e., 
(xw\xw  \  ■  ■  • , Xw'1).  For  a  function  f(x)  :  Mdlm  — >  R, 


we  use  fu  to  denote  the  restriction  of  /  obtained  by 
setting  all  the  coordinates  except  xu  to  0.  Similarly, 
fUtV  denotes  the  restriction  of  /  obtained  by  setting 
all  the  coordinates  except  xu,xv  to  0. 

In  the  reduction  that  follows,  starting  from  an 
instance  £  of  Unique  Games,  we  construct  a  dis¬ 
tribution  V  over  labeled  examples.  Let  us  denote 
by  Opt(2?)  the  agreement  of  the  best  degree-d  PTF 
on  V ;  our  constructed  distribution  has  the  following 
properties: 

•  If  Opt(£)  =  1  -  77,  then  Opt(D)  =  1  —  1?  — 
and 

•  If  Opt(£)  <  l/ke^\  then  Opt(X>)  ^  \  +  j^. 

This  immediately  yields  the  desired  hardness 
result.  We  now  describe  and  analyze  our  reduction. 


Reduction  from  Unique  Games 

Input:  Unique  Games  Instance  £(U,V,E,H,k). 
Set  @  =  I5ifc  and  &  =  2_fc2- 

1.  Randomly  choose  an  edge  (u,v)  £  E. 

2.  Set  yw  =  0  for  any  w  £  U  U  V  such  that 

W  7^  U,W  ^  V. 

3.  Generate  k  i.i.d.  bits  a,  £  {0, 1}  with  Pr[aj  = 

1]  =  (3,  2k  independent  standard  Gaussians 

1  and  a  uniform  random  sign  b  £ 

{-1,1}. 

4.  For  all  i  £  [k\,  set  y :=  gi  and  := 
Uj  hi  +  (<?7re(i))d  +  fib. 

5.  Output  the  labeled  example  (y,b). 

Lemma  4.6  (Completeness).  If  Opt(£)  =  1  —  77, 
then  there  is  a  degree-d  PTF  that  is  consistent  with 
1  —  77  —  ft  fraction  of  the  examples. 

Proof.  Suppose  that  there  is  a  labeling  L  that  satis¬ 
fies  1  —  77  fraction  of  the  edges.  Then  it  is  easy  to 
verify  that  the  degree-d  PTF 

sign(Eu£^iI'(u))-E,ey(4i(,,)))d) 

agrees  with  1  —  77  —  /?  fraction  of  the  examples.  □ 

Lemma  4.7  (Soundness).  If  Opt(£)  ^  l/k@^v\  then 
no  degree-d  PTF  agrees  with  more  than  1/2  +  2/3 
fraction  of  the  examples. 
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Proof.  Suppose  (for  the  sake  of  contradiction)  that 
some  clegree-d  polynomial  /  satisfies  1/2+2/?  fraction 
of  examples.  Then  by  an  averaging  argument,  for  (3 
fraction  of  the  edges  (u,v)  picked  in  the  first  step, 
we  have  that  f(x)  agrees  with  the  labeled  example 
(y.  b)  with  probability  1/2  +  (3.  Let  us  call  these 
edges  “good”. 

Fix  a  “good”  edge  e  =  (u,  v )  and  let  us  assume  for 
notational  convenience  that  ne  is  the  identity  map¬ 
ping.  Essentially,  we  are  conducting  the  test  Td  for 
the  restriction  fuv  with  parameter  n  :=  k.  Since  fu>v 
passes  the  test  with  probability  1/2  +  (3,  Lemma  4.2 
implies  that  we  must  have  that  Iq  b(fu),  h{fv)  7^  0 
and  \h{fv)\,  |/0.5(/u)|  <  I//?2- 

We  are  now  ready  to  give  our  randomized  label¬ 
ing  strategy  (based  on  /).  For  every  u  £  U,  randomly 
pick  its  label  from  Io.sifu)  and  for  every  v  €  V  ran¬ 
domly  pick  its  label  from  I\(fv).  It  is  clear  that  each 
good  edge  is  satisfied  with  probability  (3 2 .  Since  at 
least  (3  fraction  of  the  edges  is  good,  such  a  label¬ 
ing  satisfies  at  least  (3 3  =  l/(logfc)3  fraction  of  the 
edges  in  expectation.  Hence,  there  exists  a  labeling 
that  satisfies  such  a  fraction  of  the  edges,  which  con¬ 
tradicts  the  assumption  that  Opt(£)  <  1  /kv,  for  k 
sufficiently  large.  □ 

4.3  A  technical  point:  Discretizing  the  Gaus¬ 
sian  Distribution  Lemmas  4.6  and  4.7  do  not  quite 
suffice  to  prove  Theorem  1.1,  because  the  reduc¬ 
tion  described  above  is  not  computable  in  polyno¬ 
mial  time.  This  is  because  the  distribution  V  has 
infinite  support;  recall  that  for  each  edge  e,  sampling 
from  the  corresponding  distribution  T>e  requires  gen¬ 
erating  2k  independent  Gaussian  random  variables 
h=  (hi,...,hk),g=  (g1,...gk). 

To  discretize  the  reduction  we  replace  h  by  hf  and 
g  by  g' ,  where  each  of  the  2k  random  variables  // ,  <?' 
is  independently  generated  as  a  sum  of  N  uniform 
{— 1, 1}  bits  divided  by  .  In  Theorem  4.9  of 
Section  4.3.1,  we  argue  that  for  sufficiently  large  N 
(in  particular  any  N  >  (2 k)24(d  1  suffices),  there 
is  a  way  to  couple  the  distribution  of  ( g,h )  with 
that  of  (g1 ,  h!)  such  that  every  degree-d2  polynomial 
takes  the  same  sign  on  (g,  h)  as  on  (</,  h')  except 
with  probability  at  most  1/k.  Since  every  outcome 
of  a  €  {0,  l}fc  results  in  the  polynomial  fa(g,h) 
being  a  degree-d2  polynomial,  if  we  replace  ( g ,  h)  with 
(<?',  h!)  in  the  reduction  then  the  discretized  reduction 
will  almost  preserve  the  soundness  and  completeness 
guarantees  of  Section  4.2,  with  only  a  loss  of  /: 
writing  T)'  for  the  discretized  distribution,  we  have 

•  If  Opt(£)  >1-??,  then  Opt(P')  >  1  -  7?  -  lo4  k  - 
1/k ;  and 


•  IfOpt(£)  <  l/kv,  then  Opt {V)  <  \  +  \^[k+l/k. 

Finally,  we  observe  that  the  distribution  of 
(g\  h')  has  support  of  size  ( N  +  l)2fe  <  (2 N)2k  < 
(4fc)48d  k\  since  the  label  size  k  is  regarded  as  con¬ 
stant  in  a  Unique  Games  instance,  this  is  a  (large) 
constant  for  constant  d.  Thus  it  is  possible  to  simply 
enumerate  the  entire  support  of  T>  in  polynomial  time 
(since  there  are  \E\  distributions  Ve,  the  overall  size 
of  the  support  of  T>  is  polynomial  in  the  size  of  the 
Unique  Games  instance)  and  consequently  there  is  no 
need  for  randomness  -  the  entire  overall  reduction  is 
deterministic.  Theorem  1.1  now  follows  by  choosing 
appropriate  settings  of  7?  and  k  (e.g.,  7?  =  e/2  and 
k  =  e1/6  suffices). 

Finally,  we  note  that  the  above  remarks  imply 
that  Theorem  1.1  holds  not  only  for  constant  d,  but 
for  d  as  large  as  0(( log  n)1^4)  -  since  k  is  constant,  for 
such  d  the  support  size  (4 k)4sd  k  is  still  polynomial 
in  n. 

4.3.1  Discretizing  the  Gaussian  distribution 

The  following  theorem  shows  that  there  exists  a 
distribution  Hn/VN  that  is  point- wise  close  to  a 
Gaussian  distribution  Q  with  high  probability: 

Theorem  4.8.  There  is  a  probability  distribution 
(f+Tdjv)  on  R2  such  that  the  marginal  distribution 
Q  of  the  first  coordinate  follows  the  standard  N{ 0, 1) 
Gaussian  distribution,  and  the  marginal  distribution 
TCn  of  the  second  coordinate  is  distributed  as  a  sum  of 
N  random  bits,  i.e.,  TLn  —  where  each  bi  is 

an  independent  random  bit  from  {—1, 1}.  In  addition, 
Hn  and  Q  are  pointwise  close  in  the  following  sense: 
Pr[|0  -  ^ |  <  0(/V-1/4)]  >  1  -  C^iV”1/4). 

Proof.  Let  4>  be  the  CDF  (cumulative  distribution 
function)  of  Hn,  and  let  +  be  the  CDF  of  Q  (the 
standard  Gaussian  Distribution). 

We  couple  the  random  variables  Q,TLn  in  the 
following  way:  to  obtain  a  draw  ( g0 ,  h0)  from  the  joint 
distribution,  first  we  sample  ho  from  the  marginal 
distribution  on  Hn-  We  know  that 

Pr[WJV  =  /i0]  =  $(M-*(*o-2), 

since  if  ho  is  a  feasible  outcome  of  summing  N  bits 
then  ho  —  2  is  the  largest  feasible  outcome  that  is 
less  than  ho  (if  any  feasible  outcome  less  than  h0 
exists).  Then  we  generate  go  by  drawing  random 
samples  from  the  standard  Gaussian  distribution 
until  we  obtain  a  sample  that  lies  in  the  interval 
(v]7-1(<j)(/i0  —  2)),  4,_1(4>(ho)];  when  we  obtain  such 
a  sample,  we  set  go  to  this  value. 

It  is  not  difficult  to  see  that  the  random  variable 
Q  defined  in  this  way  follows  the  standard  Gaussian 
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distribution;  essentially  we  are  using  the  value  of 
ho  as  a  indicator  of  whether  Q  is  in  the  interval 
(^_1($(/i0  —  2)),  \f_1($(/io)]-  We  also  need  to  check 
that  Pr[7 H  —  ho]  is  equal  to  Pr [Q  G  (T  1(<l>(/io  — 
2)),  (<J>(/j.0))] -  This  is  true  because 

Pr[7 i.  =  h0] 

=  Pr[he(h0-2,h0]] 

=  <F(/i0)  -  <S>{h0  -  2) 

=  Pr[g  G  (^mho  -  2)),  tf"1^/*,))]]- 

With  the  above  coupling  of  Q  and  hi,  it  remains 
to  prove  that  every  value  in  the  interval  ('F~1(<I>(/io  — 
2)),  d'^1(<l)(^o))]  is  close  to  ho/VN ,  with  high  proba¬ 
bility  over  a  random  choice  of  ho  as  described  above. 
It  suffices  to  verify  that  the  following  two  inequalities 
each  hold  with  probability  at  least  1  —  0(7V-1/4): 


v-'mho)) 

ty-1($(h0-2)) 


ho 

Vn 

ho 

Vn 


s$  0(A^— 1/4);  and, 
<  0(N~1/4). 


We  consider  the  first  inequality;  the  first  one  is 
entirely  similar.  We  show  that  \E,_1(<I>(/io))  —  -j=  ^ 

0(7V-1/4);  the  other  direction  \P— 1  (<E>(/i0))  —  ^ 

— 0(fV-1/4)  is  similar. 

By  the  Berry-Esseen  Theorem  (Theorem  A.l  in 
Section  A),  we  have  that  |<i>(/io)  —  ^ 

Therefore,  we  have  that 


(4.6) 

y-'mho))  <  tf_1(tf  ( 


ho 


Vn  Vn  Vn 


■  Eh  o 


where  the  “error  term”  Eh0  is  the  value  for  which 

T (ho/VN  +  Eh  )  -  T(feo /VN)  =  1  /VN. 

If  \ho\  ^  \J N 1 2  N ,  then  in  an  interval  of  width 

TV1/4  around  ho  the  PDF  of  the  standard  Gaussian 
is  everywhere  at  least  fl(lV-1/4);  consequently,  if 

\ho\  ^  y/m2N  then  the  error  term  E/0  is  at  most 
0(7V-1/4)  as  required.  A  standard  Chernoff  Bound 
implies  that  Pr[|/i0|  <  ^Jvi”iv]  is  at  most  0(7V-1/4), 


and  the  argument  is  complete. 


□ 


by  taking  each  pair  ( yi,Zi )  to  be  an  i.i.d.  draw  from 
the  distribution  {Q,TLn)  of  Theorem  f.8,  where  we 
take  N  =  n24D  .  Then  we  have 


Pr[sign(/(y))  ^  sign(/(z))]  <  0(l/n). 


Proof.  First,  we  may  assume  without  loss  of  gener¬ 
ality  that  the  polynomial  /  is  normalized  so  that 
£s,0  |/(5) |  equals  1.  Since  there  are  at  most  (”^D) 
coefficients  in  /,  one  of  these  coefficients  f(S)  must 


satisfy  |/(S)|  ^ 


nDy 


now  Lemma  4.4  implies  that 


We  have 


Pr[sign(/(y))  ^  sign(/(z)]  <  Pr[|/(y)| 

To  bound  the  latter  probability  by  0(l/n),  we  show 
that  \f(y)\  ^  n~3°2  with  probability  1  —  0(l/n), 
and  that  | f(z)  —  f(y)  \  <  n~3D  with  probability 
1  —  0(l/n). 

The  first  desired  bound,  Pr[|/(y)|  <  n~3L>2]  < 
0(l/n),  is  an  immediate  consequence  of  Theo¬ 
rem  A. 2. 

For  the  second,  we  note  that  by  a  union  bound 
and  Theorem  4.8,  with  probability  at  least  1  — 
OV/N1/4)  Si  1  -  0{E)  every  i  G  [n]  satisfies 
| yi  —  Zi\  <  0(.ZV_1/4).  Standard  Chernoff  bounds 
and  Gaussian  tail  bounds  give  that  the  probability 
any  \yi\  or  \z{ \  exceeds  vfN  is  much  less  than  1/n. 
Now  similar  to  the  calculation  used  to  bound  f(r  + 
Su)  —  /(?’)  |  in  the  proof  of  Claim  4.3,  when  y  and 
z  are  0(Ar_1/4)-close  in  each  coordinate  and  each 
coordinate  is  at  most  n1/d,  we  have  that 

1/(2/)  -  f(z) |  <  0(iV1/4)  •  0(n)  <  n~3D\ 

This  concludes  the  proof.  □ 


5  Hardness  of  learning  noisy  halfspaces  with 
degree  2  PTF  hypotheses:  Proof  of 

Theorem  1.2 

Similar  to  Section  4,  the  proof  has  two  parts;  first 
(Section  5.1)  we  construct  a  dictator  test  for  degree  2 
PTFs,  and  then  (Section  5.2)  we  compose  the  dictator 
test  with  the  Label  Cover  instance  to  prove  NP- 
hardness. 


Now  we  use  the  joint  distribution  constructed  in 
Theorem  4.8  to  discretize  the  standard  n-dimensional 
Gaussian  space  for  low-degree  PTFs. 

Theorem  4.9.  Fix  any  constant  D  ^  1,  and  let 
f(x  i,...,xn)  =  X)|sj^n  f(s)  Ties  xi  be  a  degree-D 
polynomial  overMT.  Let  ( y ,  z)  G  RnxR"  be  generated 


5.1  The  Dictator  Test  The  key  gadget  in  the 
hardness  reduction  is  a  Dictator  Test  that  is  designed 
to  check  whether  a  clegree-2  PTF  is  of  the  form 
sign(xj)  for  some  i  G  [n].  Suppose  f  is  a  degree  2 
polynomial 

f(x)  =  6  +  fi(x)  +  f2(x),  where 
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fl(x)  =  ^2  CiXi  and  h{x)  =  CijXiXj. 

ie[n] 

Below  we  give  a  one-query  Dictator  Test  F2  for 
sign(/(a:)). 


72 :  Dictator  Test  for  Degree-2 
Polynomials 

Input:  A  degree-2  real  polynomial  /  :  Rn  — +  R. 
Fix  3  and  <5  :=  2~n . 

^  log  n 

1.  Generate  independent  bits  ai,  a2,  ■  ■  ■ ,  an  € 
{0, 1}  each  with  expected  value  (3. 
Generate  n  independent  N( 0, 1)  Gaus¬ 
sian  variables  gi,...,gn.  Set  r  = 
(aigi,a2g2,  ■  ■  .,angn). 

2.  Generate  t  by  randomly  picking  a  number 
i  €  {1,2, . . . ,  (logn)2}  and  set  t  =  nl. 
Generate  a  random  bit  b  €  {— 1, 1}. 

3.  Set  u  £  I"  to  be  the  all-ls  vector 
(1, . . . ,  1)  and  set  y  =  t3r  +  bt28uj. 

4.  Accept  iff  sign(/(y))  =  b. 


We  show  that  T2  has  the  following  completeness 
and  soundness  properties. 

Lemma  5.1.  (Completeness)  For  i  €  [n\,  the  poly¬ 
nomial  f(x)  =  Xi  passes  T2  with  probability  at  least 

1-/5- 

Proof.  If  f(x )  =  Xi  for  some  i  £  [n],  then  as  long  as 
a.j  is  set  to  zero  in  step  1  we  have  that  f(x)  =  bSt2 
and  /  passes  the  test.  By  definition  of  the  test  a,  is 
0  with  probability  1  —  (3.  □ 

Lemma  5.2.  (Soundness)  Let  A  denote  'Yhi=ici  and 
let  1(f)  be  the  set  {i  \  Ci  >  A/n2}.  If  a  degree-2 
polynomial  f  passes  the  test  with  probability  at  least 
1/2  +  (3,  then  |/(/)|  ^  1  / (32  and  A  >  0. 

Proof.  The  proof  is  by  contradiction.  Let  /  be  a 
degree-2  polynomial  with  |/(/)|  >  1  / (32  or  A  <  0, 
and  suppose  that  /  passes  the  test  with  probability 
at  least  \  +  (3. 

First  we  show  the  following  lemma. 

Lemma  5.3.  Pr,.[/i(r)  €  (— SA,  <L4)]  ^ 

Proof.  The  inequality  obviously  holds  for  A  ^  0  since 
the  interval  has  measure  0.  Thus  we  may  assume  that 
A  >  0  and  |/(/)|  ^  1  / /32 .  We  know  that  in  step  1 
when  generating  the  bit- vector  a,  with  probability  at 


least  1  —  (1  —  /3) I J(/)l  ^  1  —  7  at  least  one  of  the 
coordinates  in  1(f)  has  its  bit  ai  nonzero.  Fix  any 
such  outcome  for  the  bit-vector  a;  now  considering 
the  random  choice  of  the  Gaussians  gi,...,gn,  we 
have  that  the  resulting  fi(r)  is  a  Gaussian  variable 
with  variance  at  least  A2 /n4  (as  one  of  the  weights 
is  at  least  A/n2).  Using  the  standard  fact  that  an 
N(a,  g)  Gaussian  random  variable  puts  probability 
mass  at  most  t/a  on  any  interval  of  length  t,  we  have 
that  for  such  an  outcome  of  the  a-vector, 

9SA  rt3  1 

Now  a  union  bound  gives  that  for  at  most  2  of  the 
r  generated,  f(r)  is  inside  the  interval  (— SA,  5 A).  □ 

Now  we  observe  that  for  any  outcome  r,  the 
vectors  r  and  —  r  are  generated  with  equal  probability. 
Thus  an  equivalent  test  to  T2  would  be  to  generate  r,  t 
as  described  by  the  test  and  then  check  a  randomly 
selected  one  of  the  following  four  inequalities: 


(5.7) 

f(t3r  +  t2Su)  ^  0 

(5.8) 

f(t3r  —  t2Sui)  <  0 

(5.9) 

f(—t3r  +  t28co)  ^  0 

(5.10) 

f(—t3r  —  t28u>)  <  0. 

Since  /  is  assumed  to  pass  the  test  with  proba¬ 
bility  \  +  (3  an  averaging  argument  gives  that  for  a 
(3/2  fraction  of  the  possible  outcomes  of  r,  at  least  a 
(\-\-(3/ 2)  fraction  of  all  the  constraints  involving  that 
r  outcome  are  satisfied.  (Note  that  for  any  fixed  out¬ 
come  of  r  there  are  4(log  n)2  constraints,  correspond¬ 
ing  to  inequalities  (5.7)-(5.10)  for  each  of  the  (log  n)2 
possible  values  of  /.)  For  this  /3/2  fraction  of  r,  let  us 
remove  those  outcomes  r  such  that  pi(r)  €  (— cL4, 8 A) 
(recall  that  this  is  at  most  a  2/n  fraction  of  all  r- 
outcomes).  Recalling  that  (3  =  we  know  there 
are  at  least  (3/ 4  fraction  of  r-outcomes  remaining;  we 
call  these  “good”  r’s. 

Let  us  fix  a  good  r.  By  an  averaging  argument 
again,  for  any  “good”  r,  for  at  least  a  (3/A  fraction 
of  the  possible  outcomes  of  t,  at  least  3  out  of  the  4 
of  the  inequalities  that  contain  t  and  r  are  satisfied. 
There  are  4  different  ways  of  choosing  3  out  of  the  4 
constraints.  Without  loss  of  generality,  let  us  assume 
that  for  a  (3/ 16  fraction  of  the  /-outcomes,  the  first, 
second,  and  fourth  constraints  (5.7),  (5.8)  and  (5.10) 
are  satisfied.  That  is: 

(5.11)  f(t3r  +  t25to)  >  0 

(5.12)  f(t3r -t25to)  <  0 

(5.13)  f(~t3r  —  t28u)  <  0. 
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Let  us  call  these  t  “good”  for  the  corresponding  r,  and 
let  us  denote  the  set  that  contains  all  the  “good”  t  for 
a  given  “good”  r  by  Tr.  Since  the  possible  choice  of 
t  =  nl  ranges  over  all  i  £  [log2  n],  we  therefore  obtain 
\Tr\  S=  (log n)2  ■  P/16  =  ©(log n). 

Since  f{x)  is  a  degree  2  polynomial,  we  can 
express  f(r  +  Su)  as: 

f(r  +  M  =  0  +  /i(r)  +  f2(r) 

n 

+  Sj2ci+62  Y  Cj  +  S  Y  ca(ri  +  ri)- 

i—  1  1^-i^.j^.n 

Let  us  denote  B  =  cb  ancl  / 2M  = 

J2i<i<j<ncij(ri  +  ©')•  We  can  rewrite  (5.11),  (5.12), 

(5.13) "as: 

(5.14) 

t3/i(r)  +  t2M  +  t6f2{r)  +  t35f2{r)  +  482B  +  0^0 

(5.15) 

t3/i(r)  -  f2<L4  +  t6f2{r)  -  t5Sfc{r)  +  t3S2B  +  0<O 

(5.16) 

f3/i(r)  +  t2SA  -  t6f2{r)  -  t3Sf'2{r)  -  4S2B  -0>O 

Notice  that  (5.14)  and  (5.16)  yield 

fi(r)  >  —SA/t  +  \t3  f2(r)  +  St2  f2{r)  +  52tB  +  0/t3\. 

Since  we  already  know  that  fi(r)  ^  (—6 A,  5 A) 
and  t  is  at  least  1,  we  get  that 

fi{r)  >  5 A. 

Also  for  (5.15),  we  can  rewrite  it  as 

fi(r)  <  SA/t  -  ( t3f2(r )  -  St2  f'2{r)  +  S2tB  +  0/t3). 

Let  us  further  simplify  the  notation  by  writing  C 
for  f2(r),  D  for  Sf2{r)  and  E  for  S2B.  Then  we  may 
rewrite  the  above  constraints  as  follows: 

fi(r)  ^  —SA/t  +  I t3C  +  t2D  +  tE  +  0/t3 1 


Using  the  fact  that  /i(r)  >  SA,  the  inequal¬ 
ity  (5.17)  gives  —{t\C  —  t\D  +  t\E  +  0/t\ )  > 
(1  —  A)8A,  which  may  be  rewritten  as  SA  ^ 

— tiD+tiE+e/ti) '  Combining  this  with  (5.18),  we 
know  that  for  any  t\ ,  t2  £  Tr,  we  have 


—  (t-^C  —  txD  +  t\E  +  0/ti )  I  1  + 


f-  +  -Y 

3\  /  I  |  '  t\  t,2  ' 


1- 


^  \t32C +  t22D  +  t.2E  +  e/t32\. 

( i  + 1 ) 

By  definition,  tj  n  for  any  i,  so  we  have  *}_  J / 


3/n.  Therefore,  for  any  ti,t2  in  Tr,  the  following 
inequality  holds: 

(5.19) 


-{t\C  +  t\D  -  t4E  +  0 ftp) 
I  t2C  +  t2D  +  t2E  +  0 /t\ 


1 


1  + 


1  *1  ^*2  ' 


^  1— 3/n. 


Note  that  the  denominator  of  the  LHS  of  (5.19)  can 
be  zero  for  at  most  6  values  of  t2;  we  eliminate 
any  such  values  from  Tr,  and  we  still  have  \Tr\  ^ 
0(log?i).  (Actually,  we  will  only  need  |Tr|  ^  5 
for  the  remainder  of  the  argument  to  establish  the 
required  contradiction.)  Let  us  pick  f0  <  U  < 
t2  <  t$  <  t4  from  Tr ,  and  let  us  write  G  to  denote 
—  ( t3C  —  t\D  +  t\E  +  0/t3).  We  know  that 


G<t?|C|  +  f2|U|  +  U|U|  +  |0|A?. 


Also  for  t0,t2,h,U,  we  write: 

(5.20)  Fo:=t3oC-t2oD  +  toE  +  0/t3o 

(5.21)  F2  :=  t32C  -  t22D  +  t2E  +  0/t\ 

(5.22)  F3  :=  t\C  -  t\D  +  t3E  +  0/t\ 

(5.23)  F4  :=  t\C  -  t\D  +  t4E  +  0/t\. 

Let  F  denote  maxj=o,2,3,4  |Fj|.  By  (5.19)  we  know 
that 


and 

(5.17)  SA  <  A(r)  <  SA/t-(t3C-t2D  +  tE  +  0/t3). 

Notice  that  above  (upper  and  lower)  bound  hold 
for  any  t  in  Tr.  Therefore,  we  know  that  for  any 
ti,t2  £  Tr , 

SA/ti  —  ( t3C  —  t\D  +  tiE  +  0/t\) 

^  — 5A/t2  +  |  t2C  +  t2D  +  t2E  +  0/t2\ 
which  is  equivalent  to 

(5.18)  -  {tfC  -  t\D  +  ti E  +  0/t\)  +  5A{-  +  — ) 

tl  1 2 

Si  \t32C  +  t22D  +  t2E  +  e/tl\. 


(5.24)  ^  1  -  3/n. 

Viewing  C ,  D,  E,0  as  unknowns,  we  may  solve 
the  above  linear  system  consisting  of  equations 
(5. 20), (5. 21), (5. 22), (5. 23)  using  Cramer’s  rule.  We 
find  that 
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Since  0  <  t0  <  t2  <  t3  <  £4  and  these  values  are 
at  least  a  factor  of  n  apart  from  each  other,  we  have 


that 


is  Ft{t\t2t2t0  3). 


absolute  value  of 


t3 
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is  at  most  O(Ft2t3t03).  Thus  we  have  \C\  = 
Similar  analysis  shows  that 
\D\  =  0(F/t3t2 );  \E\  =  0(F/h );  and  \6\  =  0(Ft3). 

Therefore,  we  have 

G  <  \C\t\  +  t\\D\  +  ti\E\  +  \9\/t\ 

^  F  ■  +  +fi/t2t3  +  t\/t2  +  to/t3)- 


Reduction  from  Label-Cover  C 
Input:  Label  Cover  Instance  (U,V,E,k,m,H). 

1.  Randomly  pick  a  vertex  v  £  V. 

2.  For  each  w  ^  v,  w  £  U  U  V,  set  yw  =  0. 

3.  Let  ai, . . . ,  am  be  independent  {0, 1}  bits  each 
with  E[«i]  =  f3.  Let  <7i,  •  •  •  ,gm  be  independent 
7V(0,1)  Gaussian  random  variables.  Let  i 
be  chosen  uniformly  from  [(log  m)2]  and  set 
t  =  m*.  Let  b  be  a  random  uniform  bit  from 
{-1,1}- 

4.  Set  r  =  (a1g1,a2g2,  ■  •  • ,  amfl'm)- 

5.  Let  u>  £  Rm  be  w  =  and  set 

yv  :=  t3r  +  bt2Su>. 

6.  Output  the  labeled  example  (Fold (yv),b)  (we 
describe  the  folding  procedure  Fold(-)  later). 


The  learning  problem  is  to  find  a  degree  2  polynomial 
p  :  — ->  {—1,1}  such  that  sign (p(y))  =  b 

for  the  largest  possible  fraction  of  labeled  examples 
generated  as  described  above.  Let  us  denote 


Recalling  that  ti+\/tl  ^  n  as  they  are  different  powers 
of  n,  we  have  that 

%  <  0(l/n). 

This  contradicts  (5.24)  and  concludes  the  proof  of  the 
soundness  Lemma,  Lemma  5.2.  □ 

5.2  Hardness  reduction  from  Label  Cover 

Recall  that  our  reduction  is  from  a  Label  Cover  in¬ 
stance  C  specified  by  (U,V,E,k,m,H)  .  For  nota- 
tional  convenience  let  us  write  F(q)  to  denote  the 
space  of  possible  labels  for  vertex  q  £  U  U  V,  for 
u  £  U,  F(u)  denotes  [k]  and  for  v  £  V,  F(v)  denotes 

N- 

We  reduce  to  a  learning  problem  with  labeled 
examples  in  ]Rltflfc+lylm  x  {—1,1}.  Let  dim  denote 
\U\k  +  |L|?n.  For  y  £  Rdlm  and  q  £  U  U  V,  we  write 
yq'>  to  denote  the  vector  consisting  of  all  coordinates 
that  correspond  to  vertex  q ,  i.e.  yu  denotes  {yu^)ie[k] 

for  u  £  U  and  yv  denotes  (y^)*e[ml  f°r  v  £  V. 

We  give  the  reduction  from  Label  Cover  to  the 
learning  problem  below.  The  high  level  idea  is  that 
the  Dictator  Test  T2  is  performed  on  the  restricted 
function  pv  (y)  for  a  random  v  £  V. 


P(y)  =  8+  cq)yql) 

q£UUV,i£F(q) 

+  V  C(iJ)  U(i)7/(i) 

+  C(qi  ,92)^91  yq2  ‘ 

qi,q2&UUV,ieF(q1,j£F(q2) 

Notice  that  in  the  reduction,  when  vertex  v  is 
picked  we  set  all  the  coordinates  to  zero  except  yv. 
Essentially  we  are  performing  the  test  T2  on  the 
function 

pv  =  e+  cv)Vv)  +  Y  c(v(i),v(j))yv)y{v) 

i(z\m\  i,j£[m\ 

which  is  the  restriction  of  p{y)  obtained  by  setting 
all  the  coordinates  to  zero  except  those  coordinates 
corresponding  to  vertex  v.  The  overall  fraction  of 
agreement  of  p{y)  on  all  examples  is  the  average 
probability,  over  all  v  £  V,  that  p„  passes  T2. 

Folding  Trick:  We  use  the  “folding  ”  technique 
that  was  first  introduced  in  [9,  19].  The  trick  es¬ 
sentially  amounts  to  the  following:  instead  of  out- 
putting  the  labeled  example  (y,b)  in  the  last  step  of 
the  reduction,  we  output  (Fold(y),6)  where  Fold(y) 
is  the  projection  of  y  into  a  subspace  FI3-  (defined 
below).  Folding  enables  us  to  enforce  that  p  takes 
the  same  value  on  different  points  in  Rdlm  as  long  as 
they  project  to  the  same  point  in  H L . 
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We  define  the  subspaces  H ,  Hx  for  our  folding  as 
follows: 

Definition  5.4.  For  every  e  =  {it,  u}  £  E,i  £  [k] , 
we  define  b(e ,  i )  £  Rdlm  to  be  the  vector  that  has  0 
at  every  coordinate  except  that  b(e,i is  1  and  for 
every  j  £  (7re)_1(*),  b(e,j)i P  is  — 1.  Let  B  be  the 
collection  of  all  such  b(e,i),  i.e.  B  =  {b(e,i)  \  e  = 
{u,v}  £  E,i  £  [fc]}.  We  define  H  to  be  span(B)  and 
Hx  to  be  the  orthogonal  complement  of  H  in  Rdlm . 

We  define  Fold(y)  to  be  the  projection  of  y  onto 
Hx.  It  is  easy  to  see  that  the  mapping  Fold(-)  can  be 
performed  in  polynomial  time. 

After  the  folding  procedure,  we  can  further  en¬ 
force  p{x)  to  have  the  property: 

For  any  h  £  H  and  x  £  Rdlm,p(a:  +  h)  =  p(x). 

We  call  functions  that  have  the  above  property 
“folded”.  In  particular  for  e  =  {u,  v}  £  E,  c  £  R,  and 
i  £  [fc],  a  folded  function  p  satisfies  p(x  +  cb(e,i ))  = 
p{x).  If  we  view  p{y)  as  a  polynomial  only  on  t/„  and 
yi^  for  j  £  (7 re)_1(«),  then  Lemma  5.7  shows  that  we 
have  the  following  folding  property  of  p: 

Ji)  -  V  r « 

If  we  sum  over  all  possible  i,  this  implies  for  any 
edge  {u,  v},  we  have 

£ +  =  £  +• 
ie[fc]  i€[m] 

Now  we  are  ready  to  prove  Theorem  1.2.  We  will 
show  the  following  two  properties  of  the  reduction  to 
complete  the  proof. 

Lemma  5.5  (Completeness).  If  Opt(£)  =  1,  then 
there  is  a  folded  function  p(x)  that  is  consistent  with 
1  —  1/  log  m  fraction  of  the  labeled  examples  generated 
by  the  reduction. 

Lemma  5.6  (Soundness).  If  Opt(£)  <  l/mv,  then 
there  is  no  folded  degree-2  polynomial  that  is  consis¬ 
tent  with  1/2+  Iog2  m  fraction  of  the  labeled  examples 
generated  by  the  reduction. 

Combining  Lemmas  5.5  and  5.6  and  noticing 
that  m  can  be  an  arbitrarily  large  constant  (such 
as  e1/6  ),we  obtain  Theorem  1.2.  (A  discretization 
similar  to  that  of  Section  4.3  is  also  required,  and  can 
be  obtained  in  a  routine  way  by  slightly  modifying  the 
parameters  of  that  section’s  construction.) 


Proof  of  Theorem  5.5:  Suppose  that 
Opt(£)  =  1,  so  there  is  a  labeling  l  satisfying 
all  the  edges.  Then  consider  the  following  function 

P(x)=  y  x™w))- 

weu  uv 

For  every  v  £  V,  the  function  pv  is  a  dictator  and 
passes  Trn  with  probability  at  least  1  —  -  by 

Lemma  5.1.  Consequently  the  overall  probability 
that  p  passes  the  test  is  at  least  1  —  1/  log  m.  Finally, 
it  is  easy  to  check  that  thus  function  p(x)  is  folded. 

□ 

Proof  of  Theorem  5.6:  Suppose  that  there 
is  some  folded  degree-2  polynomial  p(x)  such  that 
sign(p(x))  agrees  with  more  than  |  +  lo^m  fraction 
of  the  example,  i.e.,  the  averaging  passing  probability 
of  pv  on  Tm  is  \  +  logm-  We  will  show  that  Opt(£)  > 
1  / rnn  and  thus  prove  the  theorem. 

By  an  averaging  argument,  we  know  for  a  lo^m 
fraction  of  the  vertices  v  £  V,  the  restricted  polyno¬ 
mial  pv  passes  the  test  7/  with  probability  at  laest 
\  +  ITi+i’  we  refer  t°  any  such  v  as  a  “good”  vertex. 
We  say  that  an  edge  is  “good”  if  the  V-endpoint  of 
the  edge  is  a  good  vertex.  Since  the  graph  is  regular, 
we  know  that  at  least  a  lo^m  fraction  of  all  edges  are 
“good” . 

For  a  “good”  vertex  v.  let  us  define  Iv  to  be 

m 

Iv  =  {j |  j  £  [m},ci:)  >  Ycv)/m2}- 

i— 1 

By  Lemma  5.2,  we  have  |/„|  ^  (log to)2  and 

SieM  >  -^or  every  u  £  U,  we  define  Ju  = 

{j |  j  £  [fe],c«)  ^  X)»e[fe]  c'u  /k}.  We  note  that  Ju  is 
not  empty  as 

maxc*3'  ^  Y  cu[i\/k- 
3  ie[k] 

We  define  the  following  labeling  strategy  for  C. 
For  u  £  U,  randomly  assign  it  a  label  from  Ju;  for 
v  £  V,  randomly  assign  it  a  label  from  Iv  (if  Iv  is 
empty,  we  assign  a  random  label  to  v ). 

For  every  good  edge  e  =  (u,v)  and  any  j  £  Ju, 
since  p  is  folded,  we  have  that 

E  4°  =  ^^  Ec«Vfc=  E  c^/fc- 

ieTi+bj)  *e[fc]  ie[m] 

There  is  at  least  one  label  i  in  nf1^)  such  that 
X)ie[m]  Cv^/km  ^  c«Vto2.  and  this  label  is 

therefore  in  As  noted  earlier  we  have  /,,  ^ 

(log  to)2,  and  so  by  our  randomized  labeling  strategy 
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there  is  at  least  a  l/(logm)2  probability  that  edge 
{u,  v}  is  satisfied. 

Therefore  the  above  labeling  strategy  satisfies  (in 
expectation)  at  least  l/(logm)2  fraction  of  the  good 
edges  and  consequently  at  least  1  /(log  to)3  fraction  of 
all  edges.  This  means  that  Opt(£)  >  1  /mh  and  the 
proof  is  complete.  □ 

5.2.1  Folding  Lemma 
Lemma  5.7.  Let 

n 

f(x)  =  8  +  ^2  wixi  +  X!  Wij  xixj 
i= 0 

be  a  degree  2  function.  Suppose  that  for  every  x  £ 
R",  c  £  R  we  have  f(x  +  c(  1,  —1, . . . ,  —1))  =  /( x). 
Then  w0  =  Yn=i  wi- 

Proof.  Expanding  the  equality  f(x  + 
c(l,  —1, . . . ,  —1))  =  /( x),  we  get  that 

n 

6  +  W0{x0  +  c )  +  -  c)  +  Woo(zo  +  c)2 

n 

+ w0j (a?0 + c) (xj  - c)  +  y  wij(xi-c)(xj-c) 

j=  1  l^.i^.j^.71 

n 

=  6  +  WiXi  +  WijXiXj. 

i= 0 

Since  this  equation  holds  for  all  c,  x,  if  we  express 
the  LHS  and  RHS  as  polynomials  in  the  variables 
c,  Xo,  Xi, . . . ,  xn,  the  corresponding  coefficients  must 
be  the  same.  If  we  look  at  the  coefficients  of  the 
degree- 1  monomial  c,  we  have  that  Wo  —  Ym=i  wi  = 
and  the  lemma  is  proved.  □ 

6  Conclusion 

We  have  established  two  hardness  results  for  proper 
agnostic  learning  of  low-degree  PTFs.  Our  results 
show  that  even  if  there  exist  low-degree  PTFs  that  are 
almost  perfect  hypotheses,  it  is  computationally  hard 
to  find  low-degree  PTF  hypotheses  that  perform  even 
slightly  better  than  random  guessing;  in  this  sense 
our  hardness  are  rather  strong.  However,  our  results 
do  not  rule  out  the  possibility  of  efficient  learning 
algorithms  when  e  is  sub-constant,  or  if  unrestricted 
hypotheses  may  be  used.  Strengthening  the  hardness 
results  along  these  lines  is  an  important  goal  for 
future  work,  but  may  require  significantly  new  ideas. 

Another  natural  goal  for  future  work  is  the 
following  technical  strengthening  of  our  results:  show 
that  for  any  constant  d,  it  is  hard  to  construct  a 
degree-d  PTF  that  is  consistent  with  ( \  +  e)  fraction 


of  a  given  set  of  labeled  examples,  even  if  there  exists 
a  halfspace  that  is  consistent  with  a  1  —  e  fraction  of 
the  data.  Such  a  hardness  result  would  subsume  both 
of  the  results  of  this  paper  as  well  as  much  prior  work, 
and  would  serve  as  strong  evidence  that  agnostically 
learning  halfspaces  under  arbitrary  distributions  is  a 
computationally  hard  problem. 

Appendix 

A  Probability  inequalities 

We  will  use  the  Berry-Esseen  Theorem,  which  is  a 
quantitative  version  of  the  Central  Limit  Theorem: 

Theorem  A.l.  (Berry-Esseen  Theorem)  Let 
Xi,X2,-..,xn  be  i.i.d.  uniform  {—1,1  }-valued 
random  variables.  Let  c\,...,cn  £  R  be  such  that 
1  c2  =  1  and  maxi  |c*|  ^  r.  Let  g  denote  a  unit 
Gaussian  variable  drawn  from  N( 0, 1).  Then  for  any 
9  £R,  we  have 

n 

|pr[£c^  ^  6\  -  Pr [g  <  9]\  <  r. 

i=l 

We  will  also  use  the  following  anti-concentration 
result  for  low-degree  polynomials  over  Gaussian  ran¬ 
dom  variables,  due  to  Carbery  and  Wright: 

Theorem  A. 2  ([5]).  Let  p  :  R"  — >  R  be  a  nonzero 
degree-d  polynomial  over  the  reals.  Then  for  all 
t  >  0,  we  have 

Pr^N" [|p(ar)|  <  r||p||2]  <  0(dr1/d). 
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